Project 4 : Vehicle Silhouette Classfication and PCA

Submitted by Aman Kumar Agarwal

Question :

The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

Attribute Information

Input variables:

  1. COMPACTNESS (average perim)**2/area

  2. CIRCULARITY (average radius)**2/area

  3. DISTANCE CIRCULARITY area/(av.distance from border)**2

  4. RADIUS RATIO (max.rad-min.rad)/av.radius

  5. PR.AXIS ASPECT RATIO (minor axis)/(major axis)

  6. MAX.LENGTH ASPECT RATIO (length perp. max length)/(max length)

  7. SCATTER RATIO (inertia about minor axis)/(inertia about major axis)

  8. ELONGATEDNESS area/(shrink width)**2

  9. PR.AXIS RECTANGULARITY area/(pr.axis length*pr.axis width)

  10. MAX.LENGTH RECTANGULARITY area/(max.length*length perp. to this)

  11. SCALED VARIANCE (2nd order moment about minor axis)/area ALONG MAJOR AXIS

  12. SCALED VARIANCE (2nd order moment about major axis)/area ALONG MINOR AXIS

  13. SCALED RADIUS OF GYRATION (mavar+mivar)/area

  14. SCALED RADIUS OF GYRATION (mavar+mivar)/area

  15. SKEWNESS ABOUT (3rd order moment about major axis)/sigma_min**3 MAJOR AXIS

  16. SKEWNESS ABOUT (3rd order moment about minor axis)/sigma_maj**3 MINOR AXIS

  17. SKEWNESS ABOUT (3rd order moment about minor axis)/sigma_maj**2 MINOR AXIS

  18. HOLLOWS RATIO (area of hollows)/(area of bounding polygon)

  19. CLASS : Different classes of vehicles.

Part 1 : Data Preprocessing

Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

Loading the dataset

In [2]:
dataset = pd.read_csv("D:/Study/AI-ML/Dataset/vehicle.csv")

Overview of the data

In [3]:
dataset.head()
Out[3]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [4]:
dataset.info() # Total of 846 entries but individual attributes total non-null is not 846. Indicates missing values.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    841 non-null float64
distance_circularity           842 non-null float64
radius_ratio                   840 non-null float64
pr.axis_aspect_ratio           844 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  845 non-null float64
elongatedness                  845 non-null float64
pr.axis_rectangularity         843 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                843 non-null float64
scaled_variance.1              844 non-null float64
scaled_radius_of_gyration      844 non-null float64
scaled_radius_of_gyration.1    842 non-null float64
skewness_about                 840 non-null float64
skewness_about.1               845 non-null float64
skewness_about.2               845 non-null float64
hollows_ratio                  846 non-null int64
class                          846 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
In [5]:
dataset.isnull().sum() # Many attributes contain missing values
Out[5]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64
In [6]:
dataset.describe().T
Out[6]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.0 119.0
circularity 841.0 44.828775 6.152172 33.0 40.00 44.0 49.0 59.0
distance_circularity 842.0 82.110451 15.778292 40.0 70.00 80.0 98.0 112.0
radius_ratio 840.0 168.888095 33.520198 104.0 141.00 167.0 195.0 333.0
pr.axis_aspect_ratio 844.0 61.678910 7.891463 47.0 57.00 61.0 65.0 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.0 55.0
scatter_ratio 845.0 168.901775 33.214848 112.0 147.00 157.0 198.0 265.0
elongatedness 845.0 40.933728 7.816186 26.0 33.00 43.0 46.0 61.0
pr.axis_rectangularity 843.0 20.582444 2.592933 17.0 19.00 20.0 23.0 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.0 188.0
scaled_variance 843.0 188.631079 31.411004 130.0 167.00 179.0 217.0 320.0
scaled_variance.1 844.0 439.494076 176.666903 184.0 318.00 363.5 587.0 1018.0
scaled_radius_of_gyration 844.0 174.709716 32.584808 109.0 149.00 173.5 198.0 268.0
scaled_radius_of_gyration.1 842.0 72.447743 7.486190 59.0 67.00 71.5 75.0 135.0
skewness_about 840.0 6.364286 4.920649 0.0 2.00 6.0 9.0 22.0
skewness_about.1 845.0 12.602367 8.936081 0.0 5.00 11.0 19.0 41.0
skewness_about.2 845.0 188.919527 6.155809 176.0 184.00 188.0 193.0 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0

Transforming data-types

In [7]:
dataset['class'] = dataset['class'].astype('category')
In [8]:
data = dataset.drop("class",axis=1)

Data Cleaning and Exploratory Data Analysis (EDA)

In [9]:
dataset.plot(kind="box",figsize=(20,10))
# we see that few columns have outliers while few columns do not have outliers.
# Attributes with missing values and outliers are replaced by median
# Attributes with missing values and no outliers are replaced by mean
# Outliers are replaced by the median
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x20a20fb0860>
In [10]:
# Attributes  4,5,6,11,12,14,15,16 have outliers. Treat missing values with median
# Attributes 1,2,3,7,8,9,10,13,17,18 do not have outliers. Treat missing values with mean.
outlier_attributes = [3,4,5,10,11,13,14,15]
non_outlier_attributes = [0,1,2,6,7,8,9,12,16,17]
In [11]:
for i in dataset.columns[outlier_attributes]:
    median = dataset[i].median()
    dataset[i] = dataset[i].fillna(median)
In [12]:
for i in dataset.columns[non_outlier_attributes]:
    mean = dataset[i].mean()
    dataset[i] = dataset[i].fillna(mean)
In [13]:
dataset.isnull().sum()
Out[13]:
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64
In [14]:
dataset.columns[:18]
Out[14]:
Index(['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
       'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
       'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
       'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration',
       'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
       'skewness_about.2', 'hollows_ratio'],
      dtype='object')
In [15]:
# Handling outliers
for col in dataset.columns[:18]:
    q1 = dataset[col].quantile(0.25)
    q3 = dataset[col].quantile(0.75)
    
    iqr = q3 - q1
    
    low = q1 - (1.5*iqr)
    high = q3 + (1.5*iqr)
    
    dataset.loc[(dataset[col]<low) | (dataset[col] > high),col] = dataset[col].median()
In [16]:
dataset.plot(kind="box",figsize=(20,10))

# Outliers have been removed
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x20a21852978>

1. Univariate Analysis.

In [17]:
dataset.hist(figsize=(15,15))
plt.show()
# Skewness exists among the different attributes. 
In [18]:
sns.distplot(dataset['compactness'],hist=False) #Almost normal with a little skew
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x20a21c6b6d8>
In [19]:
fig,axes = plt.subplots(nrows=3,ncols=3,figsize=(15,7))
sns.distplot(dataset['circularity'],hist=False,ax = axes[0][0])
sns.distplot(dataset['distance_circularity'],hist=False,ax = axes[0][1])
sns.distplot(dataset['scatter_ratio'],hist=False,ax = axes[0][2])
sns.distplot(dataset['pr.axis_rectangularity'],hist=False,ax = axes[1][0])
sns.distplot(dataset['scaled_variance'],hist=False,ax = axes[1][1])
sns.distplot(dataset['scaled_radius_of_gyration'],hist=False,ax = axes[1][2])
sns.distplot(dataset['skewness_about'],hist=False,ax = axes[2][0])
sns.distplot(dataset['skewness_about.1'],hist=False,ax = axes[2][1])
sns.distplot(dataset['hollows_ratio'],hist=False,ax = axes[2][2])
# Skewness axists between the attributes.
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x20a22313a20>

2. Bivariate and Multivariate Analysis.

In [20]:
sns.scatterplot(dataset['compactness'],dataset['circularity'],hue=dataset['class'])
# The compactness in the vans range from 82-100 whereas the compactness in the 
# other vehicle types have a greater spread.
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x20a22a50898>
In [21]:
sns.scatterplot(dataset['compactness'],dataset['distance_circularity'],hue=dataset['class'])
# Buses have a maximum distance circularity of 90 whereas it is greater for other classes of vehicles.
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x20a223b9710>
In [22]:
sns.scatterplot(dataset['scatter_ratio'],dataset['circularity'],hue=dataset['class'])
# Moderately strong linear co-relation
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x20a2241add8>
In [23]:
sns.scatterplot(dataset['scatter_ratio'],dataset['distance_circularity'],hue=dataset['class'])
# Strong positive linear corelation
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x20a2260c320>
In [24]:
fig,axes = plt.subplots(nrows=2,ncols=2,figsize=(15,10))
sns.scatterplot(dataset['circularity'],dataset['elongatedness'],hue=dataset['class'],ax=axes[0][0])
sns.scatterplot(dataset['distance_circularity'],dataset['elongatedness'],hue=dataset['class'],ax=axes[0][1])
sns.scatterplot(dataset['radius_ratio'],dataset['elongatedness'],hue=dataset['class'],ax=axes[1][0])
sns.scatterplot(dataset['scatter_ratio'],dataset['elongatedness'],hue=dataset['class'],ax=axes[1][1])
plt.show()
# Very strog negative co-relation between elongatedness and other attributes such as circularity,scatter_ratio etc.
In [25]:
fig,axes = plt.subplots(nrows=2,ncols=2,figsize=(15,10))
sns.scatterplot(dataset['circularity'],dataset['pr.axis_rectangularity'],hue=dataset['class'],ax=axes[0][0])
sns.scatterplot(dataset['distance_circularity'],dataset['pr.axis_rectangularity'],hue=dataset['class'],ax=axes[0][1])
sns.scatterplot(dataset['scatter_ratio'],dataset['pr.axis_rectangularity'],hue=dataset['class'],ax=axes[1][0])
sns.scatterplot(dataset['elongatedness'],dataset['pr.axis_rectangularity'],hue=dataset['class'],ax=axes[1][1])
plt.show()
# Strong positive co-relation between axis-rectangularity and other attributes such as circularity,scatter_ratio etc.
# Strong negative linear co-relation between axis-rectangularity and elongatedness.
In [26]:
fig,axes = plt.subplots(nrows=1,ncols=2,figsize=(15,5))
sns.scatterplot(dataset['scatter_ratio'],dataset['max.length_rectangularity'],hue=dataset['class'],ax=axes[0])
sns.scatterplot(dataset['elongatedness'],dataset['max.length_rectangularity'],hue=dataset['class'],ax=axes[1])

# Vans have least scatter ratio with respect to length rectangularity followed by cars and buses.
# On the other hand, vans have highest elongatedness with respect to length rectangularity followed by cars and buses.
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x20a21e23978>
In [27]:
fig,axes = plt.subplots(nrows=3,ncols=2,figsize=(15,15))
sns.scatterplot(dataset['compactness'],dataset['scaled_variance'],hue=dataset['class'],ax=axes[0][0])
sns.scatterplot(dataset['circularity'],dataset['scaled_variance'],hue=dataset['class'],ax=axes[0][1])
sns.scatterplot(dataset['distance_circularity'],dataset['scaled_variance'],hue=dataset['class'],ax=axes[1][0])
sns.scatterplot(dataset['radius_ratio'],dataset['scaled_variance'],hue=dataset['class'],ax=axes[1][1])
sns.scatterplot(dataset['scatter_ratio'],dataset['scaled_variance'],hue=dataset['class'],ax=axes[2][0])
sns.scatterplot(dataset['elongatedness'],dataset['scaled_variance'],hue=dataset['class'],ax=axes[2][1])
plt.show()
# Strong positive co-relation between scaled_variance and other attributes such as circularity,scatter_ratio etc.
# Strong negative linear co-relation between scaled_variance and elongatedness.
In [28]:
sns.pairplot(dataset,diag_kind="kde")
# We see that the attributes are highly linearly related to each other. Either positively or negatively. 
Out[28]:
<seaborn.axisgrid.PairGrid at 0x20a21bd9198>
In [29]:
sns.pairplot(dataset,diag_kind="kde",hue="class")
# Attributs have a strong linear co-relation with most of the other attributes 
# but tend to have a storng linear corelation with elongatedness.
Out[29]:
<seaborn.axisgrid.PairGrid at 0x20a2e6cb550>
In [30]:
corr = dataset.corr()
plt.figure(figsize=(13,7))
sns.heatmap(corr,annot=True)
#Many attributes have strong negative and strong positive co-relation
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x20a3b2e9e10>

Part 2 : Splitting Data & Model Building

Seperating independent and Dependent attributes

In [31]:
# Splitting data on independent and target variables
X = dataset.iloc[:,0:18]
y = dataset.iloc[:,-1]

Building model without Principal Component Analysis (PCA)

In [32]:
from sklearn.model_selection import train_test_split # Library for splitting the dataset
from sklearn.metrics import confusion_matrix         # Library for Evaluation of result
from sklearn.metrics import classification_report    # Library for vewing the classification report
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from scipy.stats import zscore
In [33]:
# Normalizing data
X_z = X.apply(zscore)
In [34]:
X_train, X_test, y_train, y_test = train_test_split(X_z,y,test_size = 0.3, random_state = 10)

-----------------------------------------Support Vector Machine--------------------------------------------

Creating the classifier and Training the model.

In [35]:
classifier_SVM_rbf_kernel = svm.SVC(gamma='auto')
In [36]:
classifier_SVM_rbf_kernel.fit(X_train,y_train)
Out[36]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Predicting the output of test data and measuring accuracy score

In [37]:
y_pred_SVM_rbf = classifier_SVM_rbf_kernel.predict(X_test)
In [38]:
svm_score_rbf = classifier_SVM_rbf_kernel.score(X_test,y_test)
svm_score_rbf
Out[38]:
0.9606299212598425

Confusion matrix for overall score

In [39]:
svm_cm = confusion_matrix(y_test,y_pred_SVM_rbf)
svm_cm
Out[39]:
array([[ 70,   0,   1],
       [  0, 120,   5],
       [  1,   3,  54]], dtype=int64)

Classification Report for class-level metrics such as precision and Recall

In [40]:
svm_cr = classification_report(y_test,y_pred_SVM_rbf)
print(svm_cr)
              precision    recall  f1-score   support

         bus       0.99      0.99      0.99        71
         car       0.98      0.96      0.97       125
         van       0.90      0.93      0.92        58

   micro avg       0.96      0.96      0.96       254
   macro avg       0.95      0.96      0.96       254
weighted avg       0.96      0.96      0.96       254

Exploring Other Kernels in SVM

In [41]:
classifier_SVM_linear_kernel = svm.SVC(kernel = "linear",gamma='auto')
In [42]:
classifier_SVM_linear_kernel.fit(X_train,y_train)
Out[42]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
In [43]:
y_pred_SVM_linear = classifier_SVM_linear_kernel.predict(X_test)
In [44]:
svm_score_linear = classifier_SVM_linear_kernel.score(X_test,y_test)
svm_score_linear
Out[44]:
0.9330708661417323
In [45]:
svm_cm2 = confusion_matrix(y_test,y_pred_SVM_linear)
svm_cm2
Out[45]:
array([[ 67,   3,   1],
       [  5, 113,   7],
       [  0,   1,  57]], dtype=int64)
In [46]:
svm_cr2 = classification_report(y_test,y_pred_SVM_linear)
print(svm_cr2)
              precision    recall  f1-score   support

         bus       0.93      0.94      0.94        71
         car       0.97      0.90      0.93       125
         van       0.88      0.98      0.93        58

   micro avg       0.93      0.93      0.93       254
   macro avg       0.92      0.94      0.93       254
weighted avg       0.94      0.93      0.93       254

--------------------------------------------------Naive Bayes--------------------------------------------------------

Creating the classifier and Training the model.

In [47]:
classifier_NB = GaussianNB()
In [48]:
classifier_NB.fit(X_train,y_train)
Out[48]:
GaussianNB(priors=None, var_smoothing=1e-09)

Predicting the output of test data and measuring accuracy score

In [49]:
y_pred_NB = classifier_NB.predict(X_test)
In [50]:
NB_score = classifier_NB.score(X_test,y_test)
NB_score
Out[50]:
0.594488188976378

Confusion matrix for overall score

In [51]:
NB_cm = confusion_matrix(y_test,y_pred_NB)
NB_cm
Out[51]:
array([[15, 16, 40],
       [ 0, 79, 46],
       [ 0,  1, 57]], dtype=int64)

Classification Report for class-level metrics such as precision and Recall

In [52]:
NB_cr = classification_report(y_test,y_pred_NB)
print(NB_cr)
# Low accuracy and recall for certain classes. Aims at improvising by applying PCA. 
              precision    recall  f1-score   support

         bus       1.00      0.21      0.35        71
         car       0.82      0.63      0.71       125
         van       0.40      0.98      0.57        58

   micro avg       0.59      0.59      0.59       254
   macro avg       0.74      0.61      0.54       254
weighted avg       0.78      0.59      0.58       254

Applying PCA

In [53]:
from sklearn.decomposition import PCA
In [54]:
pca = PCA()
In [56]:
X_pca = X_z
In [57]:
pca.fit(X_pca)
cum = np.cumsum(pca.explained_variance_ratio_)
sns.pointplot(x=np.arange(1,19),y=cum)
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x20a4529f978>
In [58]:
np.cumsum(pca.explained_variance_ratio_)
# First 9 attributes are able to define 98% of the information. 
# So the dimensions can be halved from 18 to 9 with a loss of only 2% information.
# First 7 dimensions explains more thn 95% variance.
Out[58]:
array([0.54097838, 0.72697676, 0.79314772, 0.85606307, 0.90509803,
       0.94206901, 0.95970978, 0.9723698 , 0.97962471, 0.9840582 ,
       0.98813282, 0.99172713, 0.99394982, 0.99573126, 0.99735306,
       0.99860504, 0.99966666, 1.        ])
In [59]:
plt.figure(figsize=(10,6))
sns.barplot(x=np.arange(1,19),y=pca.explained_variance_ratio_,label = "Individual Explained Variance")
plt.step(x = np.arange(1,19),y=cum,where='mid',label = "Cummulative Explained Variance")
plt.ylabel('Explained variance ratio')
plt.xlabel("Principal Components")
plt.legend(loc = 'center right')
plt.show()
In [60]:
pca = PCA(n_components=9)
X_pca = X_z
pca.fit(X_pca)

X_pca = pca.transform(X_pca)
In [61]:
X_pca = pd.DataFrame(X_pca)

Building model with pca

In [62]:
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca,y,test_size = 0.3, random_state = 10)

Support Vector Machine

Creating the classifier and Training the model.

In [63]:
classifier_SVM_rbf_kernel_pca = svm.SVC(gamma='auto')
In [64]:
classifier_SVM_rbf_kernel_pca.fit(X_train_pca,y_train)
Out[64]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Predicting the output of test data and measuring accuracy score

In [65]:
y_pred_SVM_rbf_pca = classifier_SVM_rbf_kernel_pca.predict(X_test_pca)
In [66]:
svm_score_rbf_pca = classifier_SVM_rbf_kernel_pca.score(X_test_pca,y_test)
svm_score_rbf_pca
Out[66]:
0.9606299212598425

Confusion matrix for overall score

In [67]:
svm_cm_pca = confusion_matrix(y_test,y_pred_SVM_rbf_pca)
svm_cm_pca
Out[67]:
array([[ 69,   2,   0],
       [  0, 121,   4],
       [  1,   3,  54]], dtype=int64)

Classification Report for class-level metrics such as precision and Recall

In [68]:
svm_cr_pca = classification_report(y_test,y_pred_SVM_rbf_pca)
print(svm_cr_pca)
              precision    recall  f1-score   support

         bus       0.99      0.97      0.98        71
         car       0.96      0.97      0.96       125
         van       0.93      0.93      0.93        58

   micro avg       0.96      0.96      0.96       254
   macro avg       0.96      0.96      0.96       254
weighted avg       0.96      0.96      0.96       254

Linear Kernel

Creating the classifier and Training the model.

In [69]:
classifier_SVM_linear_kernel_pca = svm.SVC(kernel = "linear",gamma='auto')
In [70]:
classifier_SVM_linear_kernel_pca.fit(X_train_pca,y_train)
Out[70]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Predicting the output of test data and measuring accuracy score

In [71]:
y_pred_SVM_linear_pca = classifier_SVM_linear_kernel_pca.predict(X_test_pca)
In [72]:
svm_score_linear_pca = classifier_SVM_linear_kernel_pca.score(X_test_pca,y_test)
svm_score_linear_pca
Out[72]:
0.8976377952755905

Confusion matrix for overall score

In [73]:
svm_cm_pca2 = confusion_matrix(y_test,y_pred_SVM_linear_pca)
svm_cm_pca2
Out[73]:
array([[ 67,   4,   0],
       [  8, 108,   9],
       [  3,   2,  53]], dtype=int64)

Classification Report for class-level metrics such as precision and Recall

In [74]:
svm_cr_pca2 = classification_report(y_test,y_pred_SVM_linear_pca)
print(svm_cr_pca2)
              precision    recall  f1-score   support

         bus       0.86      0.94      0.90        71
         car       0.95      0.86      0.90       125
         van       0.85      0.91      0.88        58

   micro avg       0.90      0.90      0.90       254
   macro avg       0.89      0.91      0.90       254
weighted avg       0.90      0.90      0.90       254

Naive Bayes

Creating the classifier and Training the model.

In [75]:
classifier_NB_pca = GaussianNB()
In [76]:
classifier_NB_pca.fit(X_train_pca,y_train)
Out[76]:
GaussianNB(priors=None, var_smoothing=1e-09)

Predicting the output of test data and measuring accuracy score

In [77]:
y_pred_NB_pca = classifier_NB_pca.predict(X_test_pca)
In [78]:
NB_score_pca = classifier_NB_pca.score(X_test_pca,y_test)
NB_score_pca
Out[78]:
0.8700787401574803

Confusion matrix for overall score

In [79]:
NB_cm_pca = confusion_matrix(y_test,y_pred_NB_pca)
NB_cm_pca
Out[79]:
array([[ 56,   6,   9],
       [  4, 115,   6],
       [  2,   6,  50]], dtype=int64)

Classification Report for class-level metrics such as precision and Recall

In [80]:
NB_cr_pca = classification_report(y_test,y_pred_NB_pca)
print(NB_cr_pca)
              precision    recall  f1-score   support

         bus       0.90      0.79      0.84        71
         car       0.91      0.92      0.91       125
         van       0.77      0.86      0.81        58

   micro avg       0.87      0.87      0.87       254
   macro avg       0.86      0.86      0.86       254
weighted avg       0.87      0.87      0.87       254

Part 3 : Comparing models with and without PCA

Initializing scores arrays

In [81]:
SVM_rbf = []
SVM_linear = []
NB = []

Accuracy scores without PCA

In [82]:
SVM_rbf.append(svm_score_rbf)
SVM_linear.append(svm_score_linear)
NB.append(NB_score)
In [83]:
SVM_rbf,SVM_linear,NB
Out[83]:
([0.9606299212598425], [0.9330708661417323], [0.594488188976378])

Comparing various accuracies with different principal components selection

In [84]:
# Based on the above plot, we select the first 7 components as they explain more than 95% of the data 
# and progressively see how the model's accuracy changes with increase in principal components.
In [85]:
for n in range(7,19):
    X_pca = X_z # Resetting the X value for each iteration
    pca = PCA(n_components=n) # Initializing pca with n components for each iteration
    pca.fit(X_pca)
    X_pca = pca.transform(X_pca)
    X_pca = pd.DataFrame(X_pca)
    
    #Splitting the transformed X (X_pca) for building the model
    X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca,y,test_size = 0.3, random_state = 10)
    
    # Model 1 - Support Vector Machine
    
    ## RBF Kernel
    classifier_SVM_rbf_kernel_pca = svm.SVC(gamma='auto')
    classifier_SVM_rbf_kernel_pca.fit(X_train_pca,y_train)
    SVM_rbf.append(classifier_SVM_rbf_kernel_pca.score(X_test_pca,y_test))
    
    ## Linear Kernel
    classifier_SVM_linear_kernel_pca = svm.SVC(kernel = "linear",gamma='auto')
    classifier_SVM_linear_kernel_pca.fit(X_train_pca,y_train)
    SVM_linear.append(classifier_SVM_linear_kernel_pca.score(X_test_pca,y_test))
    
    ## Naive Bayes
    classifier_NB_pca = GaussianNB()
    classifier_NB_pca.fit(X_train_pca,y_train)
    NB.append(classifier_NB_pca.score(X_test_pca,y_test))
In [86]:
labels = [0,7,8,9,10,11,12,13,14,15,16,17,18]
values_rbf = []
values_linear = []
values_NB = []
for i in range(0,len(SVM_rbf)):
    values_rbf.append(SVM_rbf[i])
    values_linear.append(SVM_linear[i])
    values_NB.append(NB[i])         
In [87]:
plt.figure(figsize=(15,8))
sns.pointplot(x = labels,y=values_rbf,color="#33FF3C",label="RBF"),
sns.pointplot(x = labels,y=values_linear,color="#4C33FF",label="Linear"),
sns.pointplot(x = labels,y=values_NB,color="#FF4933",label="Naive Bayes")
leg = plt.legend(labels=['RBF','Linear','Naive Bayes'], loc ='center',prop={'size':16})
leg.legendHandles[0].set_color("#33FF3C")
leg.legendHandles[1].set_color("#4C33FF")
leg.legendHandles[2].set_color("#FF4933")
plt.title("Comparing model accuracy without PCA and with PCA\n")
plt.xlabel("Principal Components")
plt.ylabel("Accuracy")
plt.show()
In [88]:
# If we consider just accuracy, then with 14 components we get the maximum accuracy in all the three models. 
# But if we want to consider according to the information content captured by the components, 
# then with only 7 components, more than 95% variance is covered. This also gives a good increase 
# in accuracy in Naive bayes but a decrease in the SVM Models. 
# Therefore, we choose 9 dimensions which shows a stable and good increase in accuracy in all the 3 models. 
# This allows us to therefore reduce the number of dimensions from 18 to 9 which is a significant good reduction. 

Part 4 : Conclusion

Insights from EDA

  1. Data has missing values.
  2. Data has outliers present.
  3. There is moderate amount of skewness in the data.
  4. There exists strong linear co-relation between many attributes.
  5. The linear co-relations are both negative and postive.
  6. The compactness in the vans range from 82-100 whereas the compactness in the other vehicle types have a greater spread.
  7. Buses have a maximum distance circularity of 90 whereas it is greater for other classes of vehicles.
  8. Very strog negative co-relation between elongatedness and other attributes such as circularity,scatter_ratio etc.
  9. Strong positive co-relation between axis-rectangularity and other attributes such as circularity,scatter_ratio etc.
  10. Strong negative linear co-relation between axis-rectangularity and elongatedness.
  11. Vans have least scatter ratio with respect to length rectangularity followed by cars and buses.
  12. On the other hand, vans have highest elongatedness with respect to length rectangularity followed by cars and buses.
  13. Strong positive co-relation between scaled_variance and other attributes such as circularity,scatter_ratio etc.
  14. Strong negative linear co-relation between scaled_variance and elongatedness.
  15. The heatmap also shows the existence of large number of strong co-relations between the different attributes.

Model Inferences

Without Applying PCA

  1. The Support vector classifier gave an accuracy of 96% on the rbf kernel.
  2. The Support vector classifier gave an accuracy of 93.30% on the Linear kernel.
  3. The Naive Bayes Classifier gave a low accuracy of approximately 60%.

After Applying PCA (7 components)

  1. The Support vector classifier gave an accuracy of 96% on the rbf kernel.
  2. The Support vector classifier gave an accuracy of 85.40% on the Linear kernel.
  3. The Naive Bayes Classifier gave a much higher accuracy of 79.92% which was previously 60%.
  4. Therefore Principal Component (PCA) Analysis helped in increasing the model performance as well as reducing the dimensions significantly.

Regularizations

  1. The Principal component analysis (PCA) technique was run through multiple values of components to help compare the accuracy and stabality of the models with the increasing number of components.
  2. With 7 components, the analysis was able to cover or capture more than 95% variance in the data.
  3. With 9 components, the analysis was able to cover or capture more than 98% variance in the data. Thereby allowing us to reduce the dimensions by half with only 2% information loss.
  4. The acccuracy of the various models also changed with the increase in the components.
  5. The models saw an increase in accuracy by 5% from 85.4% to 90% in SVM Linear Kernel, and an increase in accuracy by 8% from 79.92% to 87% in Naive Bayes.
  6. The SVM Classifier was tested for different kernels such as rbf and linear.
  7. If we consider just accuracy, then with 14 components we get the maximum accuracy in all the three models.
  8. But if we want to consider according to the information content captured by the components, then with only 7 components, more than 95% variance is covered. This also gives a good increase in accuracy in Naive bayes but a decrease in the SVM models.
  9. Therefore, we choose 9 dimensions which shows a stable and good increase in accuracy in all the 3 models.
  10. PCA therefore allows us to reduce the number of dimensions from 18 to 9 which is a significant good reduction.

Learnings

  1. Principal Component Analysis (PCA) helps in reducing the dimensions in a dataset where there exists linear co-relation between the attributes.
  2. PCA uses the concept of eigen decomposition to absorb the information content and capture as much variance as possible in the data.
  3. PCA aims at minimal information loss to help reduce the dimensions in the dataset.
  4. What was clearly visible as separate gaussians in original dimensions before is not visible any more. This is due to the fact that PCA dimesions are composite of the original dimensions.
  5. Each dimension contains a part of information from the other dimensions.
  6. Standardization of data to a common scale is very important in PCA to prevent any biasing.
  7. PCA must only be used when the data is linearly co-related.
  8. Outliers can pull down the algorithm's accuracy significantly in PCA and therefore they must be handled while cleaning the data.
  9. Support Vector Classifier in this given case gives the highest and most stable accuracy of 96%.
  10. Therefore we can use Support Vector Machine in the classsification of the vehicle silhouettes.
In [ ]: